Initializing PySpark for Hadoop Distribution
If you have configured a Hadoop Cluster (EMR/CDH) in the Chorus Data section, you can use PySpark to read and write HDFS data. This notebook can be exposed as a Python Execute operator which can be used by the downstream operators in a workflow.
This method is far more efficient than other methods for medium or large data sets and is the only viable option for reading data sets having a size of more than a few GB.
You can initialize and use PySpark in your Jupyter Notebooks for Team Studio.
Initialize Pyspark for Cluster function. This is required to accommodate Spark upgrades in the system.
Regenerate the PySpark context by clicking .
- Change the previously-generated code to the following:
os.environ['PYSPARK_SUBMIT_ARGS']= "--master yarn-client --num-executors 1 --executor-memory 1g --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.11:3.0.1 pyspark-shell"
If you do not have access to a Hadoop cluster, you can run your PySpark job in local mode. Before running PySpark in local mode, set the following configuration.
- Set the
PYSPARK_SUBMIT_ARGSenvironment variable as follows:os.environ['PYSPARK_SUBMIT_ARGS']= '--master local pyspark-shell'
YARN_CONF_DIRenvironment variable as follows:os.environ['YARN_CONF_DIR'] = ''
- Procedure
- Create a new notebook.
-
Click .
- A SELECT DATA SOURCE TO CONNECT dialog appears. Select an existing Hadoop data source, and then click Add Data Source.

- A bit of code is inserted into your notebook. This facilitates the communication between the data source and your notebook. If you want to read more data source, repeat step 1 to step 3 (maximum limit is 3 for inputs to a python execute operator). To run this code, press
shift+enteror click Run.Now, you can run other commands by referring to the comments in the inserted code.
The commands use the object
ccwhich is an instantiated object of classChorusCommanderwith the required parameters for the methods to work correctly. The generated code sets thesqlContextargument to the initialized Spark Session in thecc.read_input_filemethod call. You can setspark_optionsdictionary argument to pass additional options to Spark Data Frame Reader for the CSV format. -
To read the data sets, uncomment the lines in the generated code for that data set along with the
_propsvariable having the correspondingspark_options. -
To use the notebook as a python execute operator, change the
use_input_substitutionparameter fromFalsetoTrueand add theexecution_labelparameter for data sets to be read. Theexecution_labelvalue should start from string'1', followed by'2', and'3'for subsequent data sets. For more information, seehelp(cc.read_input_file). -
The generated
cc.read_input_filemethod call returns a Spark Data Frame. You can modify, copy, or perform any other operations on the Data Frame as required. -
Once the required output Spark Data Frame has been created, write it to a target table using the
cc.write_output_file.-
To enable the use of output in downstream operators, set
use_output_substitution=True. -
To overwrite any existing files in the same path as the target, set the
overwrite_existsparameter toTrue. -
You can set
spark_optionsdictionary argument to pass additional options to Spark Data Frame Writer for the CSV format.
For more information, see
help(cc.write_output_file).Note:-
If this is not a terminal operator in the workflow, then do not set
header=Truebecause subsequent operators use operator metadata and not the header line in the output file. -
You can use comma(
,) as a delimiter argument inwrite_output_filefor compatibility with other operators in a workflow, or do not use a delimiter argument.
-
-
Run the notebook manually for the metadata to be determined so that the notebook can be used as a Python Executor operator in legacy workflows.
Sample commands
Use the following sample commands to write a table:
cc.write_output_table(ds1_adult_csv, table_name='adult_outemr2_hadoop.csv', schema_name='Compute_s3_local',
database_name='', sqlContext=spark, spark_options=ds1_props,
overwrite_exists=True, drop_if_exists=True, use_output_substitution=True)
Use the following sample commands to write a file:
cc.datasource_name = 'EMR535_Large3'
cc.write_output_file(ds1_adult_csv, file_path='/tmp/adult_out.csv',
sqlContext=spark, file_type='csv', spark_options={}, overwrite_exists=True)